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Abstract 



The problem of topic modeling can be seen as a generalization of the clustering problem, in 
• that it posits that observations are generated due to multiple latent factors (e.g., the words in 

! each document are generated as a mixture of several active topics, as opposed to just one). This 

r/} ■ increased representational power comes at the cost of a more challenging unsupervised learning 

problem of estimating the topic probability vectors (the distributions over words for each topic) , 
when only the words are observed and the corresponding topics are hidden. 

We provide a simple and efficient learning procedure that is guaranteed to recover the pa- 
rameters for a wide class of mixture models, including the popular latent Dirichlet allocation 
(LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors 
and the prior over the topics, using only trigram statistics (i.e., third order moments, which 



\Q \ may be estimated with documents containing just three words). The method, termed Excess 



Correlation Analysis (EC A) , is based on a spectral decomposition of low order moments (third 
| and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is 

£SJ ■ scalable since the SVD operations arc carried out on k x k matrices, where k is the number 

of latent factors (e.g. the number of topics), rather than in the ti-dimensional observed space 
(typically d^> k). 

X 

b ■ 1 Introduction 

There is general agreement that there are multiple unobserved or latent factors affecting observed 
data. Mixture models offer a powerful framework to incorporate the effects of these latent variables. 
A family of mixture models, popularly known as topic models, has generated broad interest on both 
theoretical and practical fronts. 

Topic models incorporate latent variables, the topics, to explain the observed co-occurrences 
of words in documents. They posit that each document has a mixture of active topics (possibly 
sparse) and that each active topic determines the occurrence of words in the document. Usually, 
a Dirichlet prior is assigned to the distribution of topics in documents, giving rise to the so-called 
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latent Dirichlet allocation (LDA) (Blei et al., 2003). These models possess a rich representational 
power since they allow for the words in each document to be generated from more than one topic 
(i.e., the model permits documents to be about multiple topics). This increased representational 
power comes at the cost of a more challenging unsupervised estimation problem, when only the 
words are observed and the corresponding topics are hidden. 

In practice, the most common estimation procedures are based on finding maximum likeli- 
hood (ML) estimates, through either local search or sampling based methods, e.g., Expectation- 
Maximization (EM) (Redner and Walker, 1984), Gibbs sampling (Asuncion et al., 2011), and 
variational approaches (Hoffman et al., 2010). Another body of tools is based on matrix factoriza- 
tion (Hofmann, 1999; Lee and Seung, 1999). For document modeling, typically, the goal is to form 
a sparse decomposition of a term by document matrix (which represents the word counts in each 
document) into two parts: one which specifies the active topics in each document and the other 
which specifies the distributions of words under each topic. 

This work provides an alternative approach to parameter recovery based on the method of mo- 
ments (Lindsay, 1989; Lindsay and Basak, 1993), which attempts to match the observed moments 
with those posited by the model. Our approach does this efficiently through a spectral decompo- 
sition of the observed moments through two singular value decompositions. This method is simple 
and efficient to implement, based on only low order moments (third or fourth order), and is guar- 
anteed to recover the parameters of a wide class of mixture models, including the LDA model. We 
exploit exchangeability of the observed variables and, more generally, the availability of multiple 
views drawn independently from the same hidden component. 

1.1 Summary of Contributions 

We present an approach known as Excess Correlation Analysis (ECA) based on the knowledge of 
low order moments between the observed variables, assumed to be exchangeable (or, more generally, 
drawn from a multi-view mixture model). ECA differs from Principal Component Analysis (PC A) 
and Canonical Correlation Analysis (CCA) in that it is based on two singular value decompositions: 
the first SVD whitens the data (based on the correlation between two variables) and the second 
SVD utilizes higher order moments (based on third or fourth order) to find directions which exhibit 
moments that are in excess of those suggested by a Gaussian distribution. Both SVDs are performed 
on matrices of size k x k, where k is the number of latent factors, making the algorithm scalable 
(typically the dimension of the observed space d ^> k). 

The method is applicable to a wide class of mixture models including exchangeable and multi- 
view models. We first consider the class of exchangeable variables with independent latent factors, 
such as a latent Poisson mixture model (a natural Poisson model for generating the sentences in a 
document, analogous to LDA's multinomial model for generating the words in a document). We 
establish that a spectral decomposition, based on third or fourth order central moments, recovers 
the parameters for this model class. We then consider latent Dirichlet allocation and show that a 
spectral decomposition of a modified third order moment (exactly) recovers both the probability 
distributions over words for each topic and the Dirichlet prior. Note that to obtain third order 
moments, it suffices for documents to contain just 3 words. Finally, we present extensions to multi- 
view models, where multiple views drawn independently from the same latent factor are available. 
This includes the case of both pure topic models (where only one active topic is present in each 
document) and discrete hidden Markov models. For this setting, we establish that ECA correctly 
recovers the parameters and is simpler than the eigenvector decomposition methods of Anandkumar 
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et al. (2012). 

Finally, "plug-in" moment estimates can be used with sampled data. Section 5 provides a sample 
complexity of the method showing that estimating the third order moments is not as difficult as it 
might naively seem since we only need a k x k matrix to be accurate. 

Some preliminary experiments that illustrate the efficacy of the proposed algorithm are given 
in the appendix. 

1.2 Related Work 

For the case of a single topic per document, the work of Papadimitriou et al. (2000) provides the first 
guarantees of recovering the topic distributions (i.e., the distributions over words corresponding to 
each topic), albeit with a rather stringent separation condition (where the words in each topic are 
essentially non overlapping). Understanding what separation conditions (or lack thereof) permit 
efficient learning is a natural question; in the clustering literature, a line of work has focussed on 
understanding the relation between the separation of the mixture components and the complexity of 
learning. For clustering, the first learnability result (Dasgupta, 1999) was under a somewhat strong 
separation condition; a subsequent line of results relaxed (Arora and Kannan, 2001; Dasgupta and 
Schulman, 2007; Vempala and Wang, 2002; Kannan et al., 2005; Achlioptas and McSherry, 2005; 
Chaudhuri and Rao, 2008; Brubaker and Vempala, 2008; Chaudhuri et al., 2009) or removed these 
conditions (Kalai et al., 2010; Belkin and Sinha, 2010; Moitra and Valiant, 2010); roughly speaking, 
the less stringent the separation condition assumed, the more difficult the learning problem is, both 
computationally and statistically. For the topic modeling problem in which only a single topic is 
present per document, Anandkumar et al. (2012) provides an algorithm for learning topics with no 
separation (only a certain full rank assumption is utilized). 

For the case of latent Dirichlet allocation (where multiple topics are present in each document), 
the recent work of Arora et al. (2012) provides the first provable result under a certain natural 
separation condition. The notion of separation utilized is based on the existence of "anchor words" 
for topics — essentially , each topic contains words that appear (with reasonable probability) only 
in that topic (this is a milder assumption than that in Papadimitriou et al. (2000)). Under this 
assumption, Arora et al. (2012) provide the first provably correct algorithm for learning the topic 
distributions. Their work also justifies the use of non-negative matrix (NMF) as a procedure for 
this problem (the original motivation for NMF was as a topic modeling algorithm, though, prior 
to this work, formal guarantees as such were rather limited). Furthermore, Arora et al. (2012) 
provides results for certain correlated topic models. 

Our approach makes further progress on this problem by providing an algorithm which requires 
no separation condition. The underlying approach we take is a certain diagonalization technique 
of the observed moments. We know of at least three different settings which utilize this idea for 
parameter estimation. 

Chang (1996) utilizes eigenvector methods for discrete Markov models of evolution, where the 
models involve multinomial distributions. The idea has been extended to other discrete mixture 
models such as discrete hidden Markov models (HMMs) and mixture models with single active 
topics (see Mossel and Roch (2006); Hsu et al. (2009); Anandkumar et al. (2012)). A key idea in 
Chang (1996) is the ability to handle multinomial distributions, which comes at the cost of being 
able to handle only certain single latent factor/topic models (where the latent factor is in only one 
of k states, such as in HMMs). For these single topic models, the work in Anandkumar et al. (2012) 
shows how this method is quite general in that the noise model is essentially irrelevant, making it 
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applicable to both discrete models like HMMs and certain Gaussian mixture models. 

The second setting is the body of algebraic methods used for the problem of blind source separa- 
tion (Cardoso and Comon, 1996). These approaches rely on tensor decomposition approaches (see 
Comon and Jutten (2010)) tailored to independent source separation with additive noise (usually 
Gaussian). Much of literature focuses on understanding the effects of measurement noise (without 
assuming knowledge of their statistics) on the tensor decomposition, which often requires more 
sophisticated algebraic tools. 

Frieze et al. (1996) also utilize these ideas for learning the columns of a linear transformation 
(in a noiseless setting). This work provides a different efficient algorithm, based on a certain ascent 
algorithm (rather than joint diagonalization approach, as in (Cardoso and Comon, 1996)). 

The underlying insight that our method exploits is that we have exchangeable (or multi-view) 
variables, e.g., we have multiple words (or sentences) in a document, which are drawn independently 
from the same hidden state. This allows us to borrow from both the ideas in Chang (1996) 
and in Cardoso and Comon (1996). In particular, we show that the "topic" modeling problem 
exhibits a rather simple algebraic solution, where only two SVDs suffice for parameter estimation. 
Moreover, this approach also simplifies the algorithms in Mossel and Roch (2006); Hsu et al. 
(2009); Anandkumar et al. (2012), in that the eigenvector methods are no longer necessary (e.g., 
the approach leads to methods for parameter estimation in HMMs with only two SVDs rather than 
using eigenvector approaches, as in previous work). 

Furthermore, the exchangeability assumption permits us to have arbitrary noise models (rather 
than additive Gaussian noise, which are not appropriate for multinomial and other discrete distri- 
butions). A key technical contribution is that we show how the basic diagonalization approach can 
be adapted to Dirichlet models, through a rather careful construction. This construction bridges 
the gap between the single topic models (as in Chang (1996); Anandkumar et al. (2012)) and the 
independent factor model. 

More generally, the multi-view approach has been exploited in previous works for semi-supervised 
learning and for learning mixtures of well-separated distributions (e.g., as in Ando and Zhang 
(2007); Kakade and Foster (2007); Chaudhuri and Rao (2008); Chaudhuri et al. (2009)). These 
previous works essentially use variants of canonical correlation analysis (Hotelling, 1935) between 
two views. This work shows that having a third view of the data permits rather simple estimation 
procedures with guaranteed parameter recovery. 

2 The Exchangeable and Multi-view Models 

We have a random vector h = (hi, /12, . . . , hk) T € This vector specifies the latent factors (i.e., 
the hidden state), where hi specifies the value taken by i-th factor. Denote the variance of hi as 

af = E[(h i -E[h 1 ]) 2 ] 

which we assume to be strictly positive, for each i, and denote the higher Z-th central moments of 
hi as: 

■= E[(hi-E[hi]) 1 ] 
At most, we only use the first four moments in our analysis. 
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Suppose we also have a sequence of exchangeable random vectors {x±, X2, x$, X4, . . . } S M. d ; these 
are considered to be the observed variables. Assume throughout that d> k; that x±, X2, £3, X4, . . . £ 
R d are conditionally independent given h; and there exists a matrix O G M. dxk such that 



for each {1,2,3,4,...}. Throughout, we make the following assumption. 
Assumption 2.1. O is full rank. 

This is a mild assumption, which allows for identifiability of the columns of O. The goal is to 
estimate the matrix O, sometimes referred to as the topic matrix. 

Importantly, we make no assumptions on the noise model. In particular, we do not assume that 
the noise is additive (or that the noise is independent of h). 

2.1 Independent Latent Factors 

Here, suppose that h has a product distribution, i.e., each component of hi is independent from 
the rest. Two important examples of this setting are as follows: 

(Multiple) mixtures of Gaussians: Suppose x v = Oh + 77, where r/ is Gaussian noise and h 
is a binary vector (under a product distribution). Here, the i-th column Oi can be considered to 
be the mean of the i-th Gaussian component. This is somewhat different model than the classic 
mixture of fc-Gaussians, as the model now permits any number of Gaussians to be responsible for 
generating the hidden state (i.e., h is permitted to be any of the 2 k vectors on the hypercube, while 
in the classic mixture problem, only one component is responsible. However, this model imposes 
the independent factor constraint.). We may also allow 7/ to be heteroskedastic (i.e., the noise may 
depend on h, provided the linearity assumption E^j/i] = Oh holds.) 

(Multiple) mixtures of Poissons: Suppose [Oh]j specifies the Poisson rate of counts for 
[x v ]j. For example, x v could be a vector of word counts in the v-th sentence of a document (where 
xi,X2,-.. are words counts of a sequence sentences). Here, O would be a matrix with positive 
entries, and hi would scale the rate at which topic % generates words in a sentence (as specified by 
the i-th column of O). The linearity assumption is satisfied as E[x„|/i] = Oh (note the noise is not 
additive in this case). Here, multiple topics may be responsible for generating the words in each 
sentence. This model provides a natural variant of LDA, where the distribution over h is a product 
distribution (while in LDA, h is a probability vector). 

2.2 The Dirichlet Model 

Now suppose the hidden state h is a distribution itself, with a density specified by the Dirichlet 
distribution with parameter a € (a is a strictly positive real vector). We often think of h as a 
distribution over topics. Precisely, the density of h 6 A fc ~ 1 (where the probability simplex A k ~ 1 
denotes the set of possible distributions over k outcomes) is specified by: 



E[x v \h] = Oh 



1 



k 



p a (h) : 



Z[p) 



IK 1 



-1 



i=l 



where 



Z(a): 



niir(a») 

r(ao) 
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and 

a := a.\ + a 2 H + afc • 

Intuitively, «o (the sum of the "pseudo-counts" ) is a crude measure of the uniformity of the distri- 
bution. As «o - ► 0, the distribution degenerates to one over pure topics (i.e., the limiting density 
is one in which, with probability 1, precisely one coordinate of h is 1 and the rest are 0). 

Latent Dirichlet Allocation: LDA makes the further assumption that each random variable 
X2 , X3, . . . takes on discrete values out of d outcomes {e.g., x v represents what the v-th word in 
a document is, so d represents the number of words in the language). Each column of O represents 
a distribution over the outcomes {e.g., these are the topic probabilities). The sampling procedure 
is specified as follows: First, h is sampled according to the Dirichlet distribution. Then, for each 
v, independently sample i £ {1,2,... k} according to h, and, finally, sample x v according to the 
i-th column of O. Observe this model falls into our setting: represent x v with a "hot" encoding 
where [x v ]j = 1 if and only if the v-th outcome is the j-th word in the vocabulary. Hence, 
Pr([x,;]j = l\h) = [Oh]j and Efa^/i] = Oh. (Again, the noise model is not additive). 

2.3 The Multi-View Model 

The multi-view setting can be considered an extension of the exchangeable model. Here, the random 
vectors {xi, X2, X3, ■ ■ ■ } are of dimensions d±, d 2 , d^, . . . . Instead of a single O matrix, suppose for 
each v £ {1, 2, 3, ... } there exists an O v G R d ^ xk such that 

E[x v \h] = O v h 

Throughout, we make the following assumption. 

Assumption 2.2. O v is full rank for each v. 

Even though the variables are no longer exchangeable, the setting shares much of the statisti- 
cal structure as the exchangeable one; furthermore, it allows for significantly richer models. For 
example, Anandkumar et al. (2012) consider a special case of this multi-view model (where there 
is only one topic present in h) for the purposes of learning hidden Markov models. 

A simple factorial HMM: Here, suppose we have a time series of random hidden vectors 
h\, hi, hz, . . . and observations x±, x 2 , X3, . . . (we slightly abuse notation as h\ is a vector). Assume 
that each factor £ {—1, 1}. The model parameters and evolution are specified as follows: We 
have an initial (product) distribution over the first h\. The "factorial" assumption we make is 
that each factor [ht]i evolves independently; in particular, for each component i, there are (time 
independent) transition probabilities pi^^i andp^i^^i. Also suppose that E[rr f |/it] = Oht (where, 
again, O does not depend on the time). 

To learn this model, consider the first three observations xi,x 2 ,x^. We can embed this three 
timestep model into the multiview model using a single hidden state, namely h 2 , and, with an 
appropriate construction (of 0\ , 2 , 03 and means shifts of x v to make the linearity assumption 
hold). Furthermore, if we recover Oi,0 2 ,Os we can recover O and the transition model. See 
Anandkumar et al. (2012) for further discussion of this idea (for the single topic case). 

3 Identifiability 

The underlying question here is: what may we hope to recover about O with only knowledge of the 
distribution on x\, x 2 , X3, .... At best, we could only recover the columns of O up to permutation. 
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At the other extreme, suppose no a priori knowledge of the distribution of h is assumed (e.g., it 
may not even be a product distribution). Here, at best, we can only recover the range of O. In 
particular, suppose h is distributed according to a multivariate Gaussian, then clearly the columns 
of O are not identifiable. To see this, transform O to OM (where M is any k x k invertible 
matrix) and transform the distribution on h (by M _1 ); after this transformation, the distribution 
over x v is unaltered and the distribution on h is still a multivariate Gaussian. Hence, O and OM 
are indistinguishable from any observable statistics. (These issues are well understood in setting 
of independent source separation, for additive noise models without exchangeable variables. See 
Comon and Jutten (2010)). 

Thus, for the columns of O to be identifiable, the distribution on h must have some non-Gaussian 
statistical properties. We consider three cases. In the independent factor model, we consider the 
cases when h is skewed and when h has excess kurtosis. We also consider the case that h is Dirichlet 
distributed. 

4 Excess Correlation Analysis (ECA) 

We now present exact and efficient algorithms for recovering O. The algorithm is based on two 
singular value decompositions: the first SVD whitens the data (based on the correlation between 
two variables) and the second SVD is carried out on higher order moments (based on third or 
fourth order). We start with the case of independent factors, as these algorithms make the basic 
diagonalization approach clear. 

As discussed in the Introduction, these approaches can been seen as extensions of the method- 
ologies in Chang (1996); Cardoso and Comon (1996). Furthermore, as we shall see, the Dirichlet 
distribution bridges between the single topic models (as in Chang (1996); Anandkumar et al. (2012)) 
and the independent factor model. 

Throughout, we use A + to denote the pseudo-inverse: 



for a matrix A with linearly independent columns (this allows us to appropriately invert non-square 
matrices) . 

4.1 Independent and Skewed Latent Factors 

Denote the pairwise and threeway correlations as: 



The dimensions of Pairs and Triples are d 2 and d 3 , respectively. It is convenient to project Triples 
to a matrix as follows: 



A + = (A T A)- X A 



(1) 



(i 

Pairs 
Triples 



E[xi] 

E[(x 1 -/i)(x 2 -/i) T ] 

E[(xi — n) ® (x2 - n) <8> (x3 - n)] 



Triples(r/) 



E[(xi - n)(x 2 - f-i) T (v, X3 - m)] 
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Algorithm 1 ECA, with skewed factors 



Input: vector 8 G R fc ; the moments Pairs and Triples (rj) 

1. Dimensionality Reduction: Find a matrix U G M, dxk such that 



Range(?7) = Range(Pairs). 



2 



(See Remark 1 for a fast procedure.) 

Whiten: Find V G M fcxfc so V T (U T Pairs f7)V is the x identity matrix. Set: 



W = UV 



3 



SVD: Let A be the set of (left) singular vectors, with unique singular values, of 



W T Trip\es(W9)W 



4 



Reconstruct: Return the set O: 



= { {W + ) T \ : AG A} 



where W + is the pseudo- inverse (see Eq 1). 



Roughly speaking, we can think of Triples^) as a reweighing of a cross covariance (by (rj, x% — //}). 

In addition to O not being identifiable up to permutation, the scale of each column of O is also 
not identifiable. To see this, observe the model over Xj is unaltered if we both rescale any column 
Oi and appropriately rescale the variable hi- Without further assumptions, we can only hope to 
recover a certain canonical form of O, defined as follows: 

Definition 1 (The Canonical O). We say O is in a canonical form if, for each i, o~f = 1. In par- 
ticular, the transformation O <s— Odiag(o"i,cj2, . . . , £>"&) (and a rescaling of h) places O in canonical 
form, and the distribution over x\, X2, X3, . . . is unaltered. Observe the canonical O is only specified 
up to the sign of each column (any sign change of a column does not alter the variance of hi). 

Recall is the central third moment. Denote the skewness of hi as: 



Theorem 4.1 (Independent and skewed factors). We have that: 

• (No False Positives) For all 9 G M. k , Algorithm 1 returns a subset of the columns of O, in a 
canonical form. 

• (Exact Recovery) Assume 7, is nonzero for each i. Suppose 9 G M fc is a random vector 
uniformly sampled over the sphere With probability 1, Algorithm 1 returns all columns 
of O, in a canonical form. 




The first result considers the case when the skewness is non-zero. 



S 



The proof of this theorem is a consequence of the following lemma: 
Lemma 4.1. We have: 

Pairs = diag(cr^, a\, . . . , (jf.)0 T 
Triples^) = diag(0 T 7?) diag(^i i3 , /j 2 ,3, • • • , ^k,z)0 T 

The proof of this Lemma is provided in the Appendix. 

Proof of Theorem 4-1- The analysis is with respect to O it its canonical form. By the full rank 
assumption, U T Pairs U, which is a k x k matrix, is full rank; hence, the whitening step is possible. 
By construction: 

I = W T Pairs W 

= W T Odmg(alal 
= {W t O)(W t O) t 
:= MM T 

where M := W T 0. Hence, M is a k x k orthogonal matrix. 
Observe: 

W T Triples{W6)W = W t O di&g(0 T W9) diag( 7l , 72 , . . . , i k )0 T W 
= M diag(M T #) diag( 7 i, 72, . . . , 7fc )M T 

Since M is an orthogonal matrix, the above is a (not necessarily unique) singular value decomposi- 
tion of W T Triples (W9)W. Denote the standard basis as e±, e2, . . . e^. Observe that Mei, . . . Me k 
are singular vectors. In other words, W T 0\, . . . W T Ok are singular vectors, where Oi is the i-th 
column of O. 

An SVD uniquely determines all singular vectors (up to sign) which have unique singular values. 
The diagonal of the matrix diag(M T 6 l ) diag(7i, 72, . . . , Jk) is the vector diag(7i, 72, . . . , 7fc)M T #. 
Also, since M is a rotation matrix, the distribution of MO is also uniform on the sphere. Thus, if 9 
is uniformly sampled over the sphere, then every singular value will be nonzero (and distinct) with 
probability 1. Finally, for the reconstruction, we have 

W{W T W)- 1 Me i = W(W T W)~ 1 W T O i = o h 

since VF(VF T VF) _1 VF T is a projection operator (and the range of W and O are identical). □ 

Remark 1 (Finding Range(Pairs) efficiently). Suppose O G M. dxk is a random matrix with entries 
sampled independently from a standard normal. Set U = Pairs©. Then, with probability 1, 
Range (U) = Range (Pairs). 

Remark 2 (No false positives) . Note that if the skewness is for some i then ECA will not recover 
the corresponding column. However, the algorithm does succeed for those directions in which the 
skewness is non-zero. This guarantee also provides the practical freedom to run the algorithm with 
multiple different directions 9, since we need only to find unique singular vectors (which may be 
easier to determine by running the algorithm with different choices for 9). 



...,a 2 k )0 T W 
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Algorithm 2 ECA; with kurtotic factors 

Input: vectors 9,9' G M. k ; the moments Pairs and Quadruples(r/, rf) 

1. Dimensionality Reduction: Find a matrix U G M. dxk such that 

Range(?7) = Range(Pairs). 

2. Whiten: Find V G R fexfe so F T ([/ T Pairs U)V is the fc x k identity matrix. Set: 

W = UV 

3. SVD: Let A be the set of (left) singular vectors, with unique singular values, of 

W T Quadruples^, W9')W 

4. Reconstruct: Return the set O: 

6 = { (W + ) T X : A G A} 
where W + is the pseudo-inverse (see Eq 1). 

Remark 3 (Estimating the skewness). It is straight forward to estimate the skewness corresponding 
to any column of O. Suppose A is some unique singular vector (up to sign) found in step 3 of ECA 
(which was used to construct some column Oj), then: 

7i = \ T W T Triples(WX)W\ 

is the corresponding skewness for 0%. This follows from the proof, since A corresponds to some 
singular vector Mei and: 

(M ei ) T Mdiag(M T M ei )diag(7i,7 2 , . . . , 7fc )M T M ei = 7i 

using that M is an orthogonal matrix. 

4.2 Independent and Kurtotic Latent Factors 

Define the following matrix: 

Quadruples(r/, rf) := E[(xi - fj,)(x 2 - h) t {t],x 3 - //)(?/, x 4 - fj)] 

— (r] T Pairs rf) Pairs —(Pairs rf) (Pairs rj') T — (Pairs rf) (Pairs rj) T 

This is a subspace of the fourth moment tensor. 

Recall //j4 is the central fourth moment. Denote the excess kurtosis of hi as: 

— ^> 4 _ •? 
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For Gaussian distributions, recall the kurtosis is 3, and so the excess kurtosis is 0. This function is 
also common in the source separation approaches (Hyvarinen et al., 2001) 1 . 

In settings where the latent factors are not skewed, we may hope that they are differentiated 
from a Gaussian distribution due to their fourth order moments. Here, Algorithm 2 is applicable: 

Theorem 4.2 (Independent and kurtotic factors). We have that: 

• (No False Positives) For all 9,9' € Algorithm 2 returns a subset of the columns of O, in 
a canonical form. 

• (Exact Recovery) Assume K{ is nonzero for each i. Suppose 9,9' £ ]R fc are random vectors 
uniformly and independently sampled over the sphere S k ~ 1 . With probability I, Algorithm 2 
returns all the columns of O, in a canonical form. 

Remark 4 (Using both skewed and kurtotic ECA). Note that both algorithms never incorrectly 
return columns. Hence, if for every i, either the skewness or the excess kurtosis is nonzero, then 
by running both algorithms we will recover O. 

The proof of this theorem is a consequence of the following lemma: 
Lemma 4.2. We have: 
Quadruples^, rf) = O diag(0 T ?y) diag(0 T 7/) diag (7/1,4 — 3af, 7/2,4 — 3cr|, . . . , 7/^4 — 3af.)O r 

The proof of this Lemma is provided in the Appendix. 
Proof of Theorem 1^.2. The distinction from the argument in Theorem 4.1 is that: 

W T Quadruples^, W9'))W = W t O diag(0 T W9) diag(O x W9') diag(«i, k 2 , n k )0 T W 

= Mdiag(M T #) diag(M T 0') diag(/e l9 k 2 ,..., K k )M T 

The remainder of the argument follows that of the proof of Theorem 4.1. □ 
4.3 Latent Dirichlet Allocation 

Now let us turn to the case where h has a Dirichlet density, where, each hi is not sampled in- 
dependently. Even though the distribution on h is the product of hf 1 ~ l ,...hf k ~ l , the h^s are 
not independent due to the constraint that h lives on the simplex. These dependencies suggest a 
modification for the moments to be used in ECA, which we now provide. 

Suppose ao is known. Recall that ao := «i + a 2 + • • • + a k (the sum of the "pseudo-counts"). 
Knowledge of ao is significantly weaker than having full knowledge of the entire parameter vector 
a. A common practice is to specify the entire parameter vector a in a homogeneous manner, with 
each component being identical (see Steyvers and Griffiths (2006)). Here, we need only specify the 
sum, which allows for arbitrary inhomogeneity in the prior. 

x Their algebraic method require more effort due to the additive noise and the lack of exchangeability. Here, the 
exchangeability assumption simplifies the approach and allows us to address models with non-additive noise (as in 
the Poisson count model discussed in the Section 2. 
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Denote the mean as 

fi = E[xi] 

Define a modified second moment as 



a T 



Pairs Q0 := E [xix 2 \ — W 

«0 + 1 

and a modified third moment as 

Triples (77) := E[xixJ(r?, 23)] ^— ( E[xixJ]r?/x T + /ir/ T E[xixJ] + (77, /i)E[xixJ 

2«q T 
(a + 2)(a + 1) 

Remark 5 (Central vs Non-Central Moments). In the limit as ao — > 0, the Dirichlet model degen- 
erates so that, with probability 1, only one coordinate of h equals 1 and the rest are (e.g., each 
document is about 1 topic). Here, we limit to non-central moments: 

lim Pairs Q „ = Efxi^cJ] hm Triples^ (77) = E[xixJ(rj, £3)] 

In the other extreme, the behavior limits to the central moments: 

lim Pairs ao = E[(x x - (i)(x 2 - /j) t ] lim Triples^ (77) = E[(x x - /j)(x 2 - /j) t (t?, (x 3 - (J,))] 

(to prove the latter claim, expand the central moment and use that, by exchangeability, EfxixJ] = 
E[x 2 xl] = EfxixJ]). 

Our main result here shows that EC A recovers both the topic matrix O, up to a permutation 
of the columns (where each column represents a probability distribution over words for a given 
topic) and the parameter vector a, using only knowledge of ao (which, as discussed earlier, is a 
significantly less restrictive assumption than tuning the entire parameter vector). Also, as discussed 
in Remark 8, the method applies to cases where x v is not a multinomial distribution. 

Theorem 4.3 (Latent Dirichlet Allocation). We have that: 

• (No False Positives) For all 6 G M. k , Algorithm 3 returns a subset of the columns of O. 

• (Topic Recovery) Suppose 6 G R fc is a random vector uniformly sampled over the sphere 5 fe_1 . 
With probability 1, Algorithm 3 returns all columns of O. 

• (Parameter Recovery) We have that: 

a = «o(ao + 1)0 + Pairs a0 (O + ) T 1 
where 1 6 M k is a vector of all ones. 
The proof is a consequence of the following lemma: 
Lemma 4.3. We have: 



Pairs Qo = - — ^ — Odiag(a)0 1 



1 

(a + l)«o 

and 



2 

(a + 2)(a + l)a 



Triples Q() (77) = - ^-^7 r~T\ — O diag(0 T r?) diag(a)O n 
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Algorithm 3 ECA for latent Dirichlet allocation 



Input: a vector 9 G IR fc ; the moments Pairs Q0 and Triples Qo 

1. Dimensionality Reduction: Find a matrix U G M, dxk such that 

Range(c^) = Range(Pairs ao ). 
(See Remark 1 for a fast procedure.) 

2. Whiten: Find V G M fcxfc so y T (C/ T Pairs ao f7)V is the hi identity matrix. Set: 

w = uv 

3. SVD: Let A be the set of (left) singular vectors, with unique singular values, of 

W T Trip\eB ao {W6)W 

4. Reconstruct and Normalize: Return the set O: 



(w+yx :A£A 



1 T (IF+) T A 

where 1 G M. d is a vector of all ones and W + is the pseudo- inverse (see Eq 1) 



The proof of this Lemma is provided in the Appendix. 
Proof of Theorem 4-3. Note that with the following rescaling of columns: 

O = 1 =0 diag( v / aT, y/ai, y/a^) 

V (°o + l)"o 

we have that h is in canonical form (i.e., the variance of each hi is 1). The remainder of the proof 
is identical to that of Theorem 4.1. The only modification is that we simply normalize the output 
of Algorithm 1. Finally, observe that claim for estimating a holds due to the functional form of 
Pairs ao . □ 

Remark 6 (Limiting behaviors). ECA seamlessly blends between the single topic model («o ~~ * 0) 
of Anandkumar et al. (2012) and the skewness based ECA, Algorithm 1 (ao — > oo). In the single 
topic case, Anandkumar et al. (2012) provide eigenvector based algorithms. This work shows that 
two SVDs suffice for parameter recovery. 

Remark 7 (Skewed and Kurtotic ECA for LDA). We conjecture that the fourth moments can be 
utilized in the Dirichlet case such that the resulting algorithm limits to the kurtotic based ECA, 
when ao — > oo. Furthermore, the mixture of Poissions model discussed in Section 2 provides a 
natural alternative to the LDA model in this regime. 

Remark 8 (The Dirichlet model, more generally). It is not necessary that we have a multinomial 
distribution on x v , so long as E[x t ,|/i] = Oh. In some applications, it might be natural for the 
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Algorithm 4 ECA; the multi-view case 

Input: vector 9 G the moments Pairs„„' and Triples^ (r?) 

1. Project views 1 and 2: Find matrices A G IR*-'^ 1 and B G R kxd 2 suc h that APairsi2 B 
is invertible. Set: 

Pairsi2 := APairsi2-B T 
Pairs3i := Pairs3i A T 
Pairs32 := Pairs32 B T 
Tri P les i32( ? ?) : = ^ Tri P 1 esi32(??)-B T 

(See Remark 10 for a fast procedure.) 

2. Symmetrize: Reduce to a single view: 

Pairs3 := Pairs3i(Pairs 12 ) _1 Pairs23 
Triples 3 (77) := Pairs32 (Pairsi2 ) ~ 1 Triples 132 (77) (Pairsi2 ) ~ 1 Pairsi3 

3. Estimate 3 with ECA: Call Algorithm 1, with 6, Pairs3, and Triples 3 (77) . 



observations to come from a different distribution (say x v may represent pixel intensities in an 
image or some other real valued quantity). For this case, where h has a Dirichlet prior (and where 
x v may not be multinomial), ECA still correctly recovers the columns of O. Furthermore, we need 
not normalize; the set {(VF + ) T A : A £ A} recovers O in a canonical form. 

4.4 The Multi-View Extension 

Rather than O being identical for each x v , suppose for each v G {1, 2, 3, 4, ... } there exists an 
O v e R d - Xk such that 

E[x v \h] = O v h 

For v G {1,2,3}, define 

Pairs,,,,/ := E[(x v - fi)(x' v - fi) T ] 
Triples 132 (?7) := E[(xi - fi)(x 2 - /x) t (t7,x 3 - //}] 

We use the notation 132 to stress that Triples 132 (77) is a d\ x di sized matrix. 
Lemma 4.4. For v G {1, 2, 3} , 

Pairs„y = O v diag(af , cr|, . . . , cj1)0[, t 
Triples 132 (77) = Oi diag(0 3 r 77) diag(/ii, 3 , At2,3, • • • , Vk,z)Ol 



The proof for Lemma 4.4 is analogous to those in Appendix A. 
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These functional forms make deriving an SVD based algorithm more subtle. Using the methods 
in Anandkumar et al. (2012), eigenvector based method are straightforward to derive. However, 
SVD based algorithms are preferred due to their greater simplicity. The following lemma shows 
how the symmetrization step in the algorithm makes this possible. 

Lemma 4.5. For Pairs3 and Triples 3 (r/) defined in Algorithm 4, we have: 

Pairs 3 = 3 diag (erf , of , . . . , a\ ) Oj 
Triples 3 (?7) = 3 diag(0 3 n) diag(/ii, 3) M2,3, ■ ■ ■ , Mfc,3)°3 

Proof. Without loss of generality, suppose O v are in canonical form (for each i, o~f = 1). Hence, 
J 4Pairsi2-B T = AOi(B0 2 ) T ■ Hence, AO\ and B0 2 are invertible. Note that: 

Pairs 3 i A T (SPairs 2 i A T )- 1 J BPairs 23 = O s Oj A T (B0 2 Oj A T y 1 B0 2 0^ = 3 0^ 
which proves the first claim. The proof of the second claim is analogous. □ 

Again, we say that all O v are in a canonical form if, for each i, o~f = 1. 
Theorem 4.4 (The multi-view case). We have: 

• (No False Positives) For all 6 G M. k , Algorithm 4 returns a subset of 3 , in a canonical form. 

• (Exact Recovery) Assume that 7, is nonzero for each i. Suppose 6 G M k is a random vector 
uniformly sampled over the sphere S k ~ l . With probability 1, Algorithm 4 returns all columns 
°f 3 , in a canonical form. 

Proof of Theorem 4-4- The proof is identical to that of Theorem 4.1. □ 

Remark 9 (Simpler algorithms for HMMs). Mossel and Roch (2006); Anandkumar et al. (2012) 
provide eigenvector based algorithms for HMM parameter estimation. These results show that 
we can achieve parameter estimation with only two SVDs (see Anandkumar et al. (2012) for the 
reduction of an HMM to the multi-view setting). The key idea is the symmetrization that reduces 
the problem to a single view. 

Remark 10 (Finding A and B). Suppose 0,0' G M. dxk are random matrices with entries sampled 
independently from a standard normal. Set A = Pairsi^ and B = Pairs2,i ©'• With probability 
1, Range(A) = Range(Oi) and Range(i?) = Range(02), and the invertibility condition will be 
satisfied (provided that 0\ and 2 are full rank). 

5 Sample Complexity 

Let us now provide an efficient algorithm utilizing samples from documents, rather than ex- 
act statistics. The following theorem shows that the empirical version of EC A returns accurate 
estimates of the topics. Furthermore, each run of the algorithm succeeds with probability greater 
than 3/4 so the algorithm may be repeatedly run. Primarily for theoretical analysis, Algorithm 5 
uses a rescaling procedure (rather than explicitly normalizing the topics, which would involve some 
thresholding procedure; see Remark 11). 
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Algorithm 5 Empirical ECA for LDA 



Input: an integer k; an integer iV; vector 9 G the sum ao 

1. Find Empirical Averages: With ./V independent samples (of documents), compute the 
empirical first, second, and third moments. Then compute the empirical moments Pairs Q0 
and Triples^ (77) . 

2. Whiten: Let W = AT,' 1 / 2 G R dxk where A G R dxk is the matrix of the orthonormal left 
singular vectors of Pairs ao , corresponding to the largest k singular values, and £ G R kxk is 
the corresponding diagonal matrix of the k largest singular values. 

3. SVD: Let {vi,v%, . . . Vk} be the set of (left) singular vectors of 

W T Trip\es ao {W9)W 

4. Reconstruct and Scale: Return the set {Oi, O2, ■ ■ ■ Ok} where 

4 . : . , 

(a + 2)(Wv i yTriples ao (Wv i )Wv i 
(See Remark 11 for a procedure which explicitly normalizes Oi.) 



Theorem 5.1 (Sample Complexity for LDA). Fix 5 G (0, 1). Let p m - m = mhij ^ and let (Jk{0) de- 
mte me ^,, t <„on-^i s,n 3 u la r ml ue 0,0. Suppose ft* we o U „„ N > f^gMjgB) ' 



independent samples of xi, x% in the LDA model. With probability greater than 1 — 5, the follow- 
ing holds: for 6 G M fc sampled uniformly sampled over the sphere S k ~ l , with probability greater than 
3/4, Algorithm 5 returns a set {Oi, O2, ■ ■ ■ Ok} such that there exists a permutation a of {1, 2, ... k} 
(a permutation of the columns) so that for all i G {1,2, ... k} 

~ (ao + l) 2 fc 3 A + 7lmOT \ 

\\Oi - O a ^)\\ 2 < c — —3 j= 

where c is a universal constant. 



Remark 11 (Normalizing and l\ accuracy). An alternative procedure would be to just explicitly 
normalize O.;. If d large, to do this robustly, one should first set to the smallest elements and 
then normalize. The reason for clipping the smallest elements is related to obtaining low l\ error. 

Our theorem currently guarantees £2 norm accuracy of each column. Another natural error 
measure for probability distributions is the l\ error (the total variation error). Ideally, we would 
like the l\ error to be small with a number of samples does not depend on the dimension d (e.g., the 
size of the vocabulary). Unfortunately, in general, this is not possible. For example, in the simplest 
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case where k = 1 (i.e., every document is about the same topic), then this amounts to estimating 
the distributions over words for this topic; in other words, we must estimate a distribution over 
d, which may require £l(d) samples to obtain some fixed target ^i-error. However, this situation 
occurs only when the target distribution is near to uniform. If instead, for each topic, say most of 
the probability mass is contained within the most frequent (^effective words (for that topic), then it 
is possible to translate our £2 error guarantee into an i\ guarantee (in terms of (^effective)- 

6 Discussion: Sparsity 

Note that sparsity considerations have not entered into our analysis. Often, in high dimensional 
statistics, notions of sparsity are desired as this generally decreases the sample size requirements 
(often at an increased computational burden). 

Here, while these results have no explicit dependence on the sparsity level, sparsity is helpful 
in that it does implicitly affect the skewness (and the whitening) , which determines the sample 
complexity. As the model becomes less sparse, the skewness tends to 0. In particular, for the case 
of LDA, as ao — > 00 note that error increases (see Theorem 5.1). 

Perhaps surprisingly, the sparsity level has no direct impact on the computational requirements 
of a "plug-in" empirical algorithm (beyond the linear time requirement of reading the data in order 
to construct the empirical statistics). 
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A Analysis with Independent Factors 

Lemma A.l (Hidden state moments). Let z = h — E[/i]. For any vectors u, v £ 

E[zz T ] = diag(<7? , of, . . . , af) 
E[zz T (u,z)\ = diag(u)diag(// i)3 ,/i 2 ,3, • • • , A**j,3) 

and 

E[zz T (u, z)(v,z)] = diag(u) diag(v) diag(/xi ( 4 - 3of,// 2 ,4 - 3of, . . . ,/i fc)4 - 3of) 

+ (u T E[zz T ]v)E[zz T ] + {E[zz T ]u)(E[zz T ]v) T + (E[zz T ]v)(E[zz T ]u) T 

Proof. Let a, b, u and v be vectors. Since the {z{\ are independent and have mean zero, we have: 



E[(a,z)(b,z)] = E 



i=l ' ^i=l 



i=l 



i=l 



and 



E[(a,z)(6,z)(n,z)] = E 



(k \ / k \ / k \-i fc k 

i=l ' ^i=l ' ^i=l ' -I i=l i=l 



For the final claim, let us compute the diagonal and non-diagonal entries separately. First, 

E[ziZi(u,z)(v,z)] = E[y~]ujV k ZiZiZjZk] 



3,k 



U;V 



UjV. 



/'». ! + of ^ U j V j a j 



UiVifl. 



i4 - UiVi{af) 2 + afy^UjVjQ- 



For j ^ i 



UiVim^ - UiVi^i ) + (u T E[zz T ]v)a, 



E[ziZj(u,z)(v,z}] = E[y^u k viZiZjZ k zi] 



k,l 

= UiVjE[zfzj] + UjViE[zfzj] 
= UiVj of a 2 + uj Vi of o| 

= pE[^ T ]it]ipE[^ T ]t;]j + [E[zz T ]u] i [E[zz T ]7;] i 

The proof is completed by noting the (i, j)-th components of E[zz T (u, z)(v, z)\ agree with the above 
moment expressions. □ 



The proofs of Lemmas 4.1 and 4.2 follow. 
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Proof of Lemmas 1^.1 and 1^.2. By the conditional independence of {xi, £3} given h, 

E[xi] = OE[h] 

and 

E[(xi - n){x 2 - /i) T ] = E[E[(x 1 - ijl){x2 - V) r \h}] 

= E[E[(x 1 - M )|/ l ]E[(x 2 -^) T |/ l ]] 
= OE[(h - E[h])(h - E[h]) T }0 T 
= Odmg(alal...,a 2 k )0 T 

by Lemma A.l. 

Similarly, the (i, j)-th entry of Triples(r;) is 

E[(ei,xi - (j)(ej,X2 - /x)(r?,x 3 - //)] = E[E[(ej,xi - n)(ej,x 2 - fJ-){r],x 3 - fj)\h]] 

= E [E[(e i ,x l -fi)\h}- E[( ej ,x 2 -fi)\h]- E[( V , x 3 - M ) 
= E[(ei, 0(/i - E[/i]))( ej , 0(/t - E[/j]))(t7, 0(/i - E[ 
= E[{0 T ei, h - E[h}){0 T ej,h - E[h}){0 T r], h-E[ 
= ej O diag(0 T n) diag(/^ )3 , /i 2 ,3, ■ ■ ■ , Vk,3)0 T ej . 

The proof for Quadruples(ry, r/) is analogous. □ 

The proof for Lemma 4.4 is analogous to the above proofs. 

B Analysis with Dirichlet Factors 

We first provide the functional forms of the first, second, and third moments. With these, we prove 
Lemma 4.3. 

B.l Dirichlet moments 

Lemma B.l (Dirichlet moments). We have: 

E[h <8h] = t tt — (diag(a) + aa T ) 

[an + 1 an 



(a + l)a 
and 

k 



E[h ®h<g)h] = - -—y- — — f a (g) a (g) a + aj (ej (g) ej (g) a) + ai (a ® <g eA 

(a + 2)(a + l)a V 



i=l i=l 



+ ^ a, (ej <g> a <g ej) + 2 (e, <g ej (g) e, 

i=l l=\ 
Hence, for v £ U. k , 

E[(h (g h)] = — r — ((v, a)aa T + diag(a)v a T + av T diag(a) 

(a + 2)(a + l)a V 

+ (v, a) diag(a) + 2 diag(w) diag(a) 
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Proof. First, let us specify the following scalar moments. 

Univariate moments: Fix some i E [k], and let a' := a + p ■ ej for some positive integer p. 
Then 



Z(a) 

T(ai+p) r(a ) 



In particular, 



r(ai) T(a +p) 
(aj+p - l)(aj +p - 2) • • • aj 
(a +p- l)(a +p- 2) • • • <V 



(«j + l)a>i 



(a + l)a 

(ai + 2)(aj + l)a» 

(a + 2)(a + l)a ' 



Bivariate moments: Fix i, j E [fc] with i ^ j, and let a' := a + p- e« + q- Cj for some positive 
integers p and g. Then 



r(a ) 



In particular, 



Z(o0 
Z(a) 

r(a^ +p) • r(o!j + q) 

r(oi) • r(aj) r(ao + y + 9) 

((ai +p- l)(aj + p - 2) • • • a,) • ((aj + g- l)(aj + g - 2) • • • aj) 
(ao+p + q- l)(a + p + g - 2) • • • a 



E[/ii/ij] 



(a + l)a 
(aj + l)aj«j 



(a + 2)(a + l)a " 

Trivariate moments: Fix k € [k] all distinct, and let a' := a + + ej + e K . Then 

Z(a') 



l[^/ij/i«] 



Z(a) 

r(a t + 1) • r(a 3 + 1) • T(a K + 1) r(a ) 



r(at) • T(aj) • V{a K ) 



r(a + 3) 



QLi QLj Otfr 



(a + 2)(a + l)a ' 
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Completing the proof: The proof for the second moment matrix and the third moment 
tensor follows by observing that each component agrees with the above expressions. For the final 
claim, 

E[(h <g> h)(v,h)] = — — - — — — ((v,a)(a <g> a) + aavAei <g> a) + V ^amia <8> eA 

(a + 2){a + l)a V ~t ~[ 

k k 



8=1 8=1 
1 



+ y^ j ai(v, a)(ej ® e*) + 2) j ajVi(ei <8> a 

(v, a)aa T + diag(a)f a T + av T diag(a) 



(a + 2)(a + l)ao 

+ (v, a) diag(a) + 2 diag(u) diag(a" 
which completes the proof. □ 

B.2 The proof of Lemma 4.3 

Proof. Observe: 

E[xi] = OE[h] 

and 

EfxixJ] = E[E[x lX ^\h}] = OE[hh T ]0 T 
Define the analogous quantity: 

Pairs h = E[hh T ] ^—E[h]E\h} T 

«o + 1 



and so: 
Observe: 



Pairs ao = O Pairs^ O 1 



Pairsh = E[hh T ] — — aa 1 

diag(a) 



Hence, 



(a + l)a 

Pairs ao = OPairs/ l O T = - — Odiag(a)0 1 



(ao + l)a 

which proves the first claim. 
Also, define: 

Triples^) :=R[(h® h){v,h)] ^— (E[hh T ]vE[h] T + E[h]v T E[hh T ] + (v, E[/i])E[/i/i 1 



a + 2 



?n< 2 

+ ( ao + 2)K + i) ( °' E "' 1>EWE "' r 
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Since 

E[x 1 x^(i h x 3 )]=E[E[x 1 xl(r ] ,x 3 )\h]} 
= OE[hh T {r],Oh)]0 T 
= OE[hh T {0 T i],h)]0 T 

we have 

TViples ao (??) = O Triples^ (0 T i])0 T 
Let us complete the proof by showing: 

Triplet,) ^ (ao + 2)( 2 ao + 1)ao ^g(,)diag(a) 



Observe: 

2 



■ diag(u) diag(a) = E[(h<S>h)(v, h)]—-. — r — ( (v, a)aa T +diag(a)ua 1 

(an + 2 an + 1 an V 



(a + 2)(a + l)a ' (a + 2)(a + l)a 

+ av T diag(a) + (v, a) diag(a) 

Let us handle each term separately. First, 

{v,a)aa T = {v ,E[h])E[h]E[h} T 



(a + 2)(a + l)a ' (a + 2)(a + l) 

Also, since: 

r — diag(a) = E\hh T ] — r — aa 1 

(a + l)a &W L J (a + l)a 

we have: 

1 



diag(a)ua + av diag(a) + (v, a) diag(a" 



(a + 2)(a + l)a 

= (E\hh T ]va T + av T E\hh T ] + (v,a)E[hh T ]) - - r — (v,a)aa T 

a + 2 V / (a + 2)(a + l)a 

f E[hh T }vE[h] T + E[h]v T E[hh T ] + (v,E[h})E[hh T ]) - ^ -{v,E[h])E[h}E[h}~ 



a + 2V / ( ao + 2)(a + l) 

Hence, 

2 

diag(u) diag(a) = E[(h ® h){v, h}] 

E[hh T ]vE[h] T +E[h]v T E[hh T ] + (u,E[/i])E[/i/i n 



(a + 2) (an + l)a 

a 



a + 2 



In 2 



which proves the claim. □ 
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C Sample Complexity Analysis 

Throughout, we work in a canonical form. Define: 



6 := 1 =0 diag(^/al, y/a^, y/o£) 

V( a o + l ) a o 

Under this transformation, we have: 

Pairs Q0 = -. — — O diag(a)O r = OO t 

(a + l)a 

Using the definition: 

a (a + 1) 1 



li :=2 
we also have that : 



(a + 2) 2 ai 



Triples Q (77) = O diag(0 T r/) diag(7)(9 T 
Hence, we can consider ji to be the effective skewness. Let us also define: 



Since on < ao, we have that: 



p min := mm — 
1 a 



< H < 2- 



Vao + 2 y/pmm(ao + 2) 



Note that: 



and 



*k(0)J^f-<a k (0)<l 
a + l 



<Tl(P) < (71 (O) J— < 



\/ao + 1 V a o + 1 

where cr,-(-) denotes the j-th largest singular value. These lower bounds are relevant for lower 
bounding certain singular values in our analysis. 

We use ||M|| to denote the spectral norm of a matrix M. Let us suppose that for all r], 

||PairSo, — Pairs ao || = Ep 
|| Triples^ (77) - Triples Q0 (r/)|| < \\r)\\E T 

for some Ep and Et (which we set later). 
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C.l Perturbation Lemmas 

Let Pairso, . be the best rank k approximation to Pairs Q0 . We have that W, as defined in Algo- 



rithm 5, whitens Pairs, 



Let 



W T Pairs ao k W = I . 



W T Pairs Q0 W = ADA T 



be an SVD of W T Pairs Qo W, where A G 



xk . Define: 

W := WAL>" 1/2 ^ T 

and observe that W also whitens Pairs ao , i.e., 

W T Pairs ao W = {AD~ 1/2 A T ) T W T Pairs ao W(AD~ 1/2 A T ) = I 

Due to sampling error, the range(VF) may not equal the range(Pairs ao ). 
Define: 

M := W r O, M = W T 6 

Lemma C.l. Let Tlw be the orthogonal projection onto the range of W and II be the orthogonal 
projection onto the range of O. Suppose Ep < crfc( Pair Sq, )/2. We have that: 



\\M 
\\M 

\\W 
\W + 



+ 



\w 



\M -M 



\W + - W A 



in-n 



w 



= 1 

< 2 

< — 



< 



< 



< 



o- k (0) 

< 2<7l(6) 

< 3<n(0) 
4 



6ai (O) 

o-k{dy 

4 

a fc (0) 2 



E P 
Ep 
E P 



Proof. Since W whitens Pairs ao , we have MM T = W t OO t W = I and 



IMII = 1 



By Weyl's theorem (see Lemma E.l), 



\W\ 



< 



< 



CTfe(Pairs ao ) Ok (Pairs 



|Pairs QQ — Pairs QQ || a "fe(P a h'S Q , 0> 



o- k (oy 
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Also, W = WAD l / 2 A T so that M = AD l / 2 A T M and 



||M - AD 1/2 A T M\\ 
\M\\\\l- AD 1/2 A T \\ 

|I- Z? 1/2 ||||I + J D 1/2 | 
II-DII 



< 



< 



where we have used that D y and D is diagonal. 
We can bound this as follows: 



II- D\ 



< 
< 



l-ADA T \\ 
l-W T Pairs Q0 #| 



W T (Patis ao k - 
W|| 2 ||Pairs QOifc ■ 
W\\ 2 (\\P^: 0tk 

2 



Pairs Q0 )W|| 



Pairs^Q 
- Pairs, 



ii 1 1 P > <3'irsQ.Q 



Pairs 



W\\ (c7fc + i(Pairs Q!0 ) + ||Pairs ao — Pairs 



3 «0 



< 2||W|rl|Pairs 



Pairs, 



ao 



< 4- 



1 



<?k(0) 



-E P 



using Weyl's theorem in the second to last step. 

* < 2 and s 

IMII 2 < ||M|| 2 ||L>|| < 3. 



This implies ||I - D\\ < A^-^E P < 2 and so ||D|| < 3. Since M = AD l / 2 A T M, 



Again, by Weyl's theorem, 



\W~ 



ci(Pairs ao ) < o\ (Pairs ao ) + Ep < 1.5o"i(Pairs c 



1.5ai(0) s 



Using that W = WAD~ l / 2 A T , we have: 

Il^+H 2 < ||W + || 2 ||L>|| < 4.5ai(0) 2 

and 

\\W+-w + \\ < IIW+IIIH-D 1 / 2 !! < \\W + \\\\l-D\\ < 6ai{d) 



E P 



<Jk{0) 2 

which completes the argument for the first set of claims. 

We now prove the final claim. Let be the matrix of canonical angles between range(PairSo, ) 
and range(Pairs ao fc ). By Wedin's theorem (see Lemma E.3) (and noting that the A;-th singular 
value of Pairs ao k is greater than <T/ c (Pairs Q , )/2), we have 



Pairs Q( , — PairSc,,, k 
sinG < 2- r-^ 1 < 2 



I Pairs 



Pairs Qo || + 1 1 Pairs 



OtQ 



Pairs 



cj fc (Pairs ao ) 
Using Lemma E.4, ||IT — 



a k (Pairs ao ) 
sin G ||, which completes the proof. 



< 4 



E P 



a k (Pairs Q0 ) 
□ 
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Lemma C.2. Suppose Ep < o"fc(Pairs ao )/2. For \\8\\ = 1, we have: 

\\W T Tiiples ao (We)W - W T Tri^ ao (W6)W\\ < c| ; = == 77^77 + 



VPmin(a + 2) <T k (0) 2 <J k {Oy) 

where c is a universal constant. 
Proof. We have: 

\\W T Triples ao {W6)W - W T TrTples Qo (W9)W\\ < \\W T Triples Qo {W6)W - W T Triples aQ (W 9) W\\ 

+ \\W T Triples ao (W 6)W - W T T^p\^ ao {W9)W\\ 

For the second term: 

\\W T Triples ao (W9)W - W T Ti^^ ao (W9)W\\ < \\W\\ 2 \\ Triples ao (WO) - Triples^ {W9)\\ 

< \\W\\ 3 E T 

< = — Et 

~ CT k (Of 

For the first term, by expanding out the terms and using the bounds in Lemma C.l, we have: 

\\W T Triples aQ {W0)W - W T Titles ao (W9)W\\ 
= ||Mdiag(M T #)diag( 7 )M T - Mdiag(M T #) diag( 7 )M T || 

< ||Mdiag(A/ T ^)diag(7)M T - M diag(M T #) diag( 7 )M T || 
+ ||Afdiag((M-M) T 0)diag(7)M T || 

< ||Mdiag(M T 6»)diag(7)Af T - Mdiag(M T 6>) diag(7)M T || + max7;||M|| 2 ||M - M|| 

i 

< cmax7i||M - M\\ (2) 

i 

for some constant c (where the last step follows from expanding out terms). □ 
C.2 SVD Accuracy 

Let o~i and Vi denote the corresponding z-th singular value (in increasing order) and vector of 
W T Triples Qo (W0)W. Similarly, let Vi and &i denote the corresponding i-th singular value (in 

increasing order) and vector of M^ T Triples Qo (1^#)M^. For convenience, choose the sign of iii so that 

(vi,Vi) > 0. 

The following lemma characterizes the accuracy of the SVD: 

Lemma C.3 (SVD Accuracy). Suppose Ep < <7fc(Pairs Qo )/2. With probability greater than 1 — 5' , 
we have for all i : 

.. ... k 3 ^T2 ( E P E T \ 

\\Vi ~ Vi\\ < C I , == + 



VPmin(a + 2) a k (0) 2 a k (Of 



for some universal constant c. 
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First, let us provide a few lemmas. Let 

\\W T Tv[ples aQ (W9)W - W T Tr^ptes~ ao (W9)W\\ < E 
where the bound on E is provided in Lemma C.2. 
Lemma C.4. Suppose for all i 

ai>A 

\0~i - <7j+l| > A 



For all i, Vi and di, we have: 



\Vi - Vi\\ < 2 



VkE 



A-E 

where the sign of Vi chosen so (vi,Vi) > 0. 

Proof. Let cos(#) = (vi,Vi) (which is positive since we assume (vi,Vi) > 0). We have: 



Vi - Vi 



2(1 - cos(0)) =< 2(1 - cos 2 (6)) = 2sin 2 (#) 



By Weyl's theorem (see Lemma E.l) and by assumption, 

min \&i — o~j\ > A — E 

i 

and 

min \&i — o~j\ > min |cr, — o~j\ — E > A — E 

By Wedin's theorem (see Lemma E.2 applied to the split where Vi and Vi correspond to the subspaces 
Ui and Ui), 



| sin(0) |<^2^^<V2.^ 



A-E 



A-E 



□ 



Lemma C.5. Fix any 5 G (0, 1) and matrix A € M, kxk . Let 9 G M. k be a random vector distributed 
uniformly over S k ~ 1 . With probability greater than 1 — 5, we have 



min|(0,A(ej - e,-)}| > 



and 



mm\{6,Aei)\ > 



min^j \\A(ej - • 6 
min,- IIAedl • 5 



Proof. By Lemma D.2, for any fixed pair {i,j} Q [k] and /3 := 5o/y/e, 



Pr 



|(0,A( ei - ei ))|<P( ei - ei ) 



J_. A 



(I(l-(5 2 / e ) + ln(5 2 /e)) 



Similarly, for each i 
Pr 



< e ^Q(l-(J 2 /e) + ln(«5 2 /e))) < 5 - 
Let := 5//c 2 . The claim follows by a union bound over all (2) + k < k 2 possibilities. 



i<0,Aai<p ei |i--L.A: 

Vk \/e. 



□ 



29 



We now complete the argument. 

Proof of Lemma C.3. Choose A = diag(7i,72, • • • , 7fc)M T , where M = W T 0. The proof of Theo- 
rem 4.1 shows (ei, AO) are the singular values. Also the minimal singular value of A is greater than 
mim 7,' > , 1 , _ < (since MM T = I). Hence, we have: 

Oi > 



2fc 2 - 5 v / ao+2 ' 

U _ rr I •> 5 

|Oi 0- l+ l\ S 2^.2.5^ 

Suppose E < A/2. Here, 

IK - Vl \\< 2^-^ < 4— = 8 D 

Also, since \\i>i — v%\\ < 2 the above also holds for E 1 > A/2, which proves the first claim. □ 
C.3 Reconstruction Accuracy 

Lemma C.6. Suppose Ep < (Pairs ao )/2. With probability greater than 1 — 5', we have that for 
all i: 

\\Oi-l (w+y^w < 4 ao+ ^f A/ ^{E P ,E T } 

where {Oi, O2, ■ ■ ■ O^} is some permutation of the columns of O. 

Proof. First, observe that W whitens Pairs ao n^. To see, observe that = Uyy (since H\y 
is an orthogonal projection) and W = HwW = W; so 

W T (n w Pairs Q0 njr) W = {TlJyWy Pairs^ (H^W) = W T Pairs ao W = I 

Using the definition M = W t O = W T TlwO, we have: 

W T Triples ao (W 9)W = Mdiag(M T 0) diag( 7 i,7 2 , . . . ,7 fc )M T 

Since range(W / ) = r&nge(UwO) the proof of Theorem 4.1 shows that: 

U w Oi = (W + y Vi 

Define: 

* 2 



(ao + 2)(^) T THples^ 
Since Vi = M T ei are the singular vectors of W T Triples ao (W9)W, we have 

(Wvi) T Triples Q0 (^)^ = li 

and so: 



Zi 



CM 



(ao + l)«o 
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This implies Oi = ZiOi and so 



since U w Oi = (W + ) T Vi. 

Now let us bound the reconstruction error as follows: 

Qi-l {w + Yvi\\ 

Zi 

Oi - n w o l \\ + \\u w o l - i {w+y^w 

Zi 

UOi - U w Oi\\ + \\-^(W + ) T Vi - i (W+yvt 



< 



< 



< 



< 



< 



'Zi 

1 



z, 
1 



1 



1 



n - u w \\\\o t \\ + \\—(w + ) T Vi - — (w + ) T Vi\\ + II— (w + ) T Vi - {w + ) T v. 



n-nw|| + 



n-n w || + 



\w + \ 
\w+\ 



z. 
i 



z, 



1 



\Vi - Vi\\ + \\—W + - ^r- 

"Zi Zi 



Vi - Vi\\ + WttW + - —W + \\ + II— W + --rW 



z. 



Zi 



Zi 



^ „ \\w + \\.. . „ 1 
n - n w \\ h — - — \\vi - Vi\\ + 



ry II L L 1 1 1 

6% &i 



w + - w + 


+ ll^ + ll 


1 1 






z~% 



For bounding \-k — 4-|, first observe: 



|(^ l ) T TViples ao (^ l )^i; i - (^) T Triples ao (WO i )^| 

< \{Wv t ) T Ttitfe&^WvdWvi - {Wvi) T Triples ao (W^)^l 
+|(^) T Triples ao (^)^ - (Wvi) T Thp^ ao (Wvi)Wvi\ 

< c\\vi - Vi\\ max7j + ||W T Triples^ (W^W - W T Tii^ Qo (Wvi)W\\ 



where c is a constant and where that last step uses an argument similar to that of Equation 2 
(along with the bounds ||V7 t O|| = 1, ||M T Wj|| < 1 and ||M T {>j|| < 1). Continuing, 

\(Wvi) T Triples ao (Wvi)Wvi - (W$ i ) T Tti^ ao (W'vi)Wvi\ 

I Ep Ex \ 

< c\\\vi - Vi\\ max7j + c 2 77^7 + 



ypmin(ao + 2) a k (0) 2 a k (6)\ 

< k 3 ( Ep | E T \ 

~ C3 5'V& V VPmin(ao + 2) a k {6y a k {Of) 

using that ji < 2 , 1 in the last step (for constants ci, 02,03). 

(a +2) 
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The fourth term is bounded as follows: 
1 1 



\W + \ 



Q p + 2 
2 

k 3 (a + 2 



\(W Vi ) r Triples^WvJWvi - (^) T Triples ao (^)W^ 




< cAW 



< c 2 ai{0) — 



<5'V& V\/Pmin(ao + 2) a fc (0) 2 a fc (0) J 



for constants ci, 02,03. 
We have: 



11 < c^P = c. l( o)> (ao + 1) < 0^(6), ^ 



y Ctlj V Pmin 

(for a constant c), so for the second term: 

11^+11 11 ^ fc 3 (g + 2) / gp 

— 7; — bi — Vi\\ < cai(O)—- — — — ~ = — 

Zi " \^p min (a + 2) a k {0Y a k {Of ) 

The remaining terms can be show to be of lower order, so that: 

\\Oi-— (W + ) T Vi < ca 1 (Q)-^-^=^- — = ^ + i— 

Zi S'^p— \y Pmin (a + 2) a fc (0) 2 ^(O)^ 

< c ^ 3 V^+2 / (q + 2) 1 / 2 £ P + (q + 2) 3 / 2 £ t \ 
k 3 (a + 2) / £ P (a + 2)ff T \ 

^ wo) 2 a fc (o)3 ; 



C2 

^min 



using that a k {6) > a k {0)^^ and ffl (0) < □ 
C.4 Completing the proof 

Proof of Theorem 5.1. Lemma D.l and the definition of Pairs ao and Triples ao imply that: 



|Pairso, — Pairs Qo || < 3 



N 



Triples (rj) ~ Tri P les a fa) II < H 



N 

for a constant c (by expanding out the terms and by using 5/3 results in a total error proba- 
bihty of 5). Hence, we can take E P = E T = c l+ ^^ . Since N > ( ^^^ff 7 ^ ) 2 * 
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I 6+6 fp!,i!l ~^T I ) the condition < o"fc(Pairs an )/2 is satisfied. The proof is completed using 
Lemma C.6. □ 



D Tail Inequalities 



Lemma D.l (Lemma A.l in Anandkumar et al. (2012)). Fix 5 G (0, 1). Let xi,X2,x 3 are random 
variables in which ||xi||, ||x2||, ||x3|| are bounded by 1, almost surely. Let E[x\] be the empirical 
average of N independent copies of x\; let -EfxixJ] be the empirical average of N independent copies 
of x ixj ; let E\x\X2 {m x 3 )]. be the empirical average of N independent copies of x\x\ (r\,x 3 ). Then 



1. Pr 



2. Pr 



3. Pr 



\E[xi] ~ E[ Xl ]\\ F < '-±^1 



\E[ Xl xt] - E[ Xl xt]\\ F < 



> 1-5 

>l-6 



Vr? G R d , \\E[ Xl x^(ri,x 3 )} - E[ Xi x^( V ,x 3 )]\\f < h " 2(1+ ^™ ) 



> 1-5. 



Lemma D.2 (Dasgupta and Gupta (2003)). Let 6 G M. n be a random vector distributed uniformly 
over 5 n_1 , and fix a vector v G W 1 . 



1. If P G (0,1), then 



Pr 



KMI < IN 







<exp(^(l-/3 2 + ln/3 2 ; 



2. IfP> 1, then 



Pr 



n 



<exp(^(l-/3 2 + ln/3 2 ; 



Proof. This is a special case of Lemma 2.2 from Dasgupta and Gupta (2003). 



□ 



E Matrix Perturbation Lemmas 



Lemma E.l (Weyl's theorem; Theorem 4.11, p. 204 in Stewart and Sun (1990)). Let A, E G 

with m > n be given. Then 

max\ai(A + E) - Oi{A)\ < \\E\\. 



it=\n\ 



Lemma E.2 (Wedin's theorem; Theorem 4.1, p. 260 in Stewart and Sun (1990))). Let A,E£ 
with m > n be given. Let A have the singular value decomposition 



uj 
uj 



A [ V\ V 2 



Si 
E 2 




Here, we do not suppose Si and E2 have singular values in any order. Let A := A + E, with 
analogous singular value decomposition {U\, U2, U3, Si, E2, V1V2) (again with no ordering to the 
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singular values). Let $ be the matrix of canonical angles between range (i7i) and range (f/i), and G 
be the matrix of canonical angles between range (Vi) and range(Vi). Suppose there exists a 5 such 
that: 

[T,2]jj\ > 5 and min > 5, 



min|[Ei]ii 



then 



sin$||^ + ||sine||| < 2 ^} F . 



Lemma E.3 (Wedin's theorem; Theorem 4.4, p. 262 in Stewart and Sun (1990).). Let A,Ee 
with m > n be given. Let A have the singular value decomposition 



uj 



A [ V\ V 2 



Si 

s 2 





Let A := A + E, with analogous singular value decomposition (Ui, U2, U3, Si, S2, V\ V2). Let $ be 
the matrix of canonical angles between range(?7i) and range(£/i), and be the matrix of canonical 
angles between range(Vi) and range(Vi). Lf there exists 5, a > such that minj<7j(Si) > a + 5 and 
maxj<Tj(S2) < a, then 

max{|| sin <I>||2, || sin ©H2} < — ^ — . 



Lemma E.4. Let G be the matrix of canonical angles between range(X) and range(Y). Let Tlx 
and Hy be the orthogonal projections onto range(X) and range(y), respectively. We have: 

— Ily || = || sin 0|| 

Proof. See Theorem 4.5, p. 92, and Corollary 4.6, p. 93, in Stewart and Sun (1990). □ 
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F Illustrative empirical results 



We applied Algorithm 5 to the UCI "Bag of Words" dataset comprised of New York Times articles. 
This data set has 300000 articles and a vocabular of size d = 102660; we set k = 50 and ao = 0. 
Instead of using a single random 9 and obtaining singular vectors of W T Triples Q0 (W6)W, we used 
the following power iteration to obtain the singular vectors {vi,i>2j ■ ■ ■ > %-} : 

{vi, $2, ■ ■ ■ i Vk} 4— random orthonormal basis for R fe . 
Repeat: 

1. For i = 1,2,..., k: 

% i- W r T*iples ao (Wvi)Wvi. 

2. Orthonormalize {v\, V2, ■ ■ ■ , 

The top 25 words (ordered by estimated conditional probability value) from each topic are shown 
below. 
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